Fine-grained file differences

The diff utility compares files by lines, which is often what you’d like it to do. But sometimes you’d like more granularity.

For example, supposed we want to compare two versions of Psalm 23. Here are the first three verses in the King James version:

The Lord is my shepherd; I shall not want.
He maketh me to lie down in green pastures:
he leadeth me beside the still waters.
He restoreth my soul:
he leadeth me in the paths of righteousness
for his name’s sake.

And here are the corresponding lines from a more contemporary translation, the English Standard Version:

The Lord is my shepherd; I shall not want.
He makes me lie down in green pastures.
He leads me beside still waters.
He restores my soul.
He leads me in paths of righteousness
for his name’s sake.

Save these in two files, ps23.kjv and ps23.esv. If we run

diff ps23.kjv ps23.esv

we get

2,5c2,5 < He maketh me to lie down in green pastures: < he leadeth me beside the still waters. < He restoreth my soul: < he leadeth me in the paths of righteousness --- > He makes me lie down in green pastures. > He leads me beside still waters. > He restores my soul. > He leads me in paths of righteousness

This says that the two files differ in lines 2 through 5; the first and last lines are identical. The output shows lines 2 through 5 from each file but doesn’t show how they differ.

To see more fine-grained differences, such as changing maketh to makes, we can run the version of diff that comes with git.

If we run

git diff --word-diff ps23.kjv ps23.esv

we can compare the files by words rather than by lines. This produces

diff --git a/ps23.kjv b/ps23.esv index b90b858..be2a1a8 100644 --- a/ps23.kjv +++ b/ps23.esv @@ -1,6 +1,6 @@ The Lord is my shepherd; I shall not want. He [-maketh-]{+makes+} me[-to-] lie down in green [-pastures:-] [-he leadeth-]{+pastures.+} {+He leads+} me beside[-the-] still waters. He [-restoreth-]{+restores+} my [-soul:-] [-he leadeth-]{+soul.+} {+He leads+} me in[-the-] paths of righteousness for his name's sake.

The colors help make the test more readable, assuming you can see the difference between red and green. I assume the color scheme is configurable. But the text is readable without the color highlighting. For example, in the first line we have

[-maketh-]{+makes+}

which means we remove the word maketh and add the word makes.

We can compare the files on an even finer level, comparing by characters rather than words. For example, rather than saying we need to change maketh to makes the software can say we need to change the th ending to s. We can do this by running

git diff  --word-diff-regex=. ps23.kjv ps23.esv

The option --word-diff-regex=. says to use the regular expression . to indicate word boundaries. Since the dot matches any character, this says to chop the lines into individual characters.

diff --git a/ps23.kjv b/ps23.esv index b90b858..be2a1a8 100644 --- a/ps23.kjv +++ b/ps23.esv @@ -1,6 +1,6 @@ The Lord is my shepherd; I shall not want. He make[-th-]{+s+} me [-to -]lie down in green pastures[-:-] [-h-]{+.+} {+H+}e lead[-eth-]{+s+} me beside [-the -]still waters. He restore[-th-]{+s+} my soul[-:-] [-h-]{+.+} {+H+}e lead[-eth-]{+s+} me in [-the -]paths of righteousness for his name's sake.

As before we have square brackets to indicate what to remove and curly braces to indicate what to add, but now we’re removing and adding letters rather than words.

We can get a more compact display of the differences if we rely on color alone, by adding the --word-diff=color option.

git diff  --word-diff=color --word-diff-regex=. ps23.kjv ps23.esv

produces the following.

diff --git a/ps23.kjv b/ps23.esv index b90b858..be2a1a8 100644 --- a/ps23.kjv +++ b/ps23.esv @@ -1,6 +1,6 @@ The Lord is my shepherd; I shall not want. He makeths me to lie down in green pastures: h. He leadeths me beside the still waters. He restoreths my soul: h. He leadeths me in the paths of righteousness for his name's sake.

Equivalently, we can combine the two options

--word-diff=color --word-diff-regex=.

into the one option

--color-words=.

that specifies the word separation regular expression as an option to --color-words.

This may be the most convenient way to see the differences, provided you can distinguish the colors, and don’t need to use the plain text programmatically. Without the colors, makeths, for example, becomes simply makeths and we can no longer be sure what changed.

4 thoughts on “Fine-grained file differences

  1. Coincidentally I just started using those git diff options a couple of months ago.

    I find that looking at the git diffs in the three different ways: ‘vanilla’, ‘word-diff’ and ‘word diff+regex’ is extremely useful for understanding both in-progress commits and old commits.

    They are so useful that I have git aliases for the three variants for both ‘git diff’ and ‘git show’ and then bash aliases for the git aliases: gd, gd2, gd3, gs, gs2, gs3.

    -John

  2. Any idea how to change the colors being used? Lots of people (1/12) won’t see the red on black or red/green difference

  3. Nice. vimdiff is also handy for finding colorized fine-grained diffs like this, though it brings you into vim to see the results rather than simply outputting them.

Comments are closed.